LSTM AutoEncoder

気象情報をセンサデータと見立てて異常検知してみる。入力は気象庁の過去の天気情報で入手。1980年から2020年(06月まで)の毎日の天気情報。

In [1]:
import pandas as pd
import numpy as np

import os
from tqdm import tqdm

import tensorflow as tf
import pandas as pd
pd.options.mode.chained_assignment = None
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline
sns.set(style='whitegrid',palette ='muted')
rcParams['figure.figsize']=14, 8
np.random.seed(1)
tf.random.set_seed(1)
print("TensorFlow Version: ",tf.__version__)
TensorFlow Version:  2.2.0

データの準備

予め超かんたんなクレンジングを実行済み。input配下にCSVが入れてある状況。
サンプルとして一番初めの月だけ見てみる。その後全期間のデータをマージ、風向きデータのOneHotが面倒なので可視化後に除外。数値データだけ残して正規化する。
※最高気温、最低気温、平均気温など相関の強いカラムが多いので本来はカラムを絞ったり次元削減するべきだが、とりあえずのモデルなのでそこまではこだわらない。

In [2]:
INPUT_DIR = 'input/'
In [3]:
csv_files = os.listdir(INPUT_DIR)
csv_files.sort()
In [4]:
# とりあえず初めの月だけ表示してみる
input_path = os.path.join(INPUT_DIR, csv_files[0])
df = pd.read_csv(input_path)
df
Out[4]:
date precipitation_sum precipitation_max_hour precipitation_max_10minute temperature_mean temperature_max temperature_min wind_velocity_mean wind_velocity_max wind_velocity_max_direction maximum_instantaneous_wind_speed maximum_instantaneous_wind_speed_direction most_common_wind_direction solar_irradiation_hour snowfall snow_accumulation
0 1980/1/1 0.0 0.0 0.0 6.8 8.6 3.6 3.0 6.0 NNE 0.0 NaN NE 5.8 0.0 0.0
1 1980/1/2 3.0 2.0 0.0 5.3 8.2 2.1 2.3 6.0 NE 0.0 NaN NNE 5.7 0.0 0.0
2 1980/1/3 26.0 4.0 0.0 6.7 9.7 3.9 1.6 3.0 NNE 0.0 NaN NNE 0.1 0.0 0.0
3 1980/1/4 20.0 8.0 0.0 11.6 17.9 7.5 2.9 7.0 NE 0.0 NaN E 5.3 0.0 0.0
4 1980/1/5 3.0 2.0 0.0 5.1 8.4 2.4 3.2 8.0 NE 0.0 NaN NE 3.5 0.0 0.0
5 1980/1/6 0.0 0.0 0.0 7.5 12.1 1.6 4.0 9.0 SW 0.0 NaN SW 8.3 0.0 0.0
6 1980/1/7 0.0 0.0 0.0 7.8 12.2 0.4 6.4 12.0 WSW 0.0 NaN WSW 7.8 0.0 0.0
7 1980/1/8 0.0 0.0 0.0 2.8 5.9 0.0 1.6 4.0 NE 0.0 NaN NNE 9.2 0.0 0.0
8 1980/1/9 0.0 0.0 0.0 4.1 6.8 0.7 1.2 4.0 ENE 0.0 NaN NE 2.1 0.0 0.0
9 1980/1/10 2.0 1.0 0.0 6.9 10.9 1.4 3.3 9.0 WSW 0.0 NaN WSW 7.8 0.0 0.0
10 1980/1/11 0.0 0.0 0.0 4.6 7.2 2.1 2.0 4.0 NNE 0.0 NaN NNE 7.4 0.0 0.0
11 1980/1/12 0.0 0.0 0.0 5.3 8.7 1.9 2.2 4.0 SW 0.0 NaN NE 7.2 0.0 0.0
12 1980/1/13 22.0 3.0 0.0 3.2 5.2 1.5 2.0 3.0 NNE 0.0 NaN NNE 0.0 0.0 0.0
13 1980/1/14 0.0 0.0 0.0 7.5 10.9 3.0 4.7 8.0 WSW 0.0 NaN WSW 8.8 0.0 0.0
14 1980/1/15 0.0 0.0 0.0 6.2 9.3 2.0 4.5 7.0 WSW 0.0 NaN WSW 9.4 0.0 0.0
15 1980/1/16 0.0 0.0 0.0 6.6 8.6 4.0 6.0 10.0 WSW 0.0 NaN WSW 9.5 0.0 0.0
16 1980/1/17 0.0 0.0 0.0 7.3 10.5 5.0 7.3 11.0 SW 0.0 NaN WSW 8.5 0.0 0.0
17 1980/1/18 0.0 0.0 0.0 3.7 8.8 -0.2 1.6 4.0 NW 0.0 NaN NNE 8.5 0.0 0.0
18 1980/1/19 0.0 0.0 0.0 5.9 9.0 1.3 2.3 6.0 SSW 0.0 NaN SW 8.0 0.0 0.0
19 1980/1/20 0.0 0.0 0.0 9.8 11.6 7.7 5.8 9.0 WSW 0.0 NaN WSW 5.6 0.0 0.0
20 1980/1/21 0.0 0.0 0.0 5.8 9.7 1.9 3.9 8.0 WSW 0.0 NaN WSW 8.0 0.0 0.0
21 1980/1/22 0.0 0.0 0.0 3.1 7.8 -0.2 1.1 3.0 NE 0.0 NaN NNE 7.1 0.0 0.0
22 1980/1/23 0.0 0.0 0.0 5.0 8.6 0.0 2.1 6.0 WSW 0.0 NaN WSW 9.7 0.0 0.0
23 1980/1/24 0.0 0.0 0.0 7.0 10.5 2.4 2.6 5.0 NE 0.0 NaN NNE 7.3 0.0 0.0
24 1980/1/25 0.0 0.0 0.0 4.6 8.9 1.0 1.8 4.0 NE 0.0 NaN NNE 9.6 0.0 0.0
25 1980/1/26 0.0 0.0 0.0 6.3 11.5 1.1 1.3 3.0 ENE 0.0 NaN NNE 9.7 0.0 0.0
26 1980/1/27 0.0 0.0 0.0 9.5 14.2 1.6 3.2 7.0 SW 0.0 NaN SW 8.7 0.0 0.0
27 1980/1/28 57.0 15.0 0.0 13.9 16.9 12.4 2.5 5.0 SW 0.0 NaN SW 1.4 0.0 0.0
28 1980/1/29 12.0 6.0 0.0 9.2 11.1 7.9 3.2 5.0 NE 0.0 NaN NNE 0.0 0.0 0.0
29 1980/1/30 30.0 5.0 0.0 7.4 9.9 5.7 2.5 4.0 NNE 0.0 NaN NNE 0.3 0.0 0.0
30 1980/1/31 0.0 0.0 0.0 7.9 11.0 4.2 3.8 10.0 WSW 0.0 NaN WSW 8.5 0.0 0.0
In [5]:
# 全データを結合
df = pd.DataFrame(columns=df.columns)

for csv_file in tqdm(csv_files):
    input_path = os.path.join(INPUT_DIR, csv_file)
    tmp_df = pd.read_csv(input_path, parse_dates=['date'])
    df = pd.concat([df, tmp_df])
df = df.fillna('').reset_index(drop=True)
100%|██████████| 486/486 [00:03<00:00, 129.05it/s]
In [6]:
# 風向きはいらない
direction_cols = ['wind_velocity_max_direction', 'maximum_instantaneous_wind_speed_direction', 'most_common_wind_direction']
train_cols = [col for col in df.columns.tolist() if not (col in direction_cols)]
df = df[train_cols]
In [7]:
train_size = int(len(df)*0.85)
test_size = len(df)-train_size
train,test = df.iloc[0:train_size],df.iloc[train_size:len(df)]

display(train.tail(5))
display(test.head(5))

print(train.shape,test.shape)

feature_cols = train.drop('date', axis=1).columns.tolist()
date precipitation_sum precipitation_max_hour precipitation_max_10minute temperature_mean temperature_max temperature_min wind_velocity_mean wind_velocity_max maximum_instantaneous_wind_speed solar_irradiation_hour snowfall snow_accumulation
12568 2014-05-30 0.0 0.0 0.0 21.0 25.5 17.6 3.5 6.5 10.9 11.3 0.0 0.0
12569 2014-05-31 0.0 0.0 0.0 21.5 26.0 18.9 3.4 5.6 8.9 12.6 0.0 0.0
12570 2014-06-01 0.0 0.0 0.0 21.0 25.6 17.4 2.5 4.6 7.1 10.0 0.0 0.0
12571 2014-06-02 0.0 0.0 0.0 21.8 27.6 17.2 1.9 4.2 6.4 10.1 0.0 0.0
12572 2014-06-03 0.0 0.0 0.0 21.4 27.1 17.7 2.2 4.7 7.7 6.9 0.0 0.0
date precipitation_sum precipitation_max_hour precipitation_max_10minute temperature_mean temperature_max temperature_min wind_velocity_mean wind_velocity_max maximum_instantaneous_wind_speed solar_irradiation_hour snowfall snow_accumulation
12573 2014-06-04 0.0 0.0 0.0 21.7 27.0 18.6 2.2 5.8 8.9 6.0 0.0 0.0
12574 2014-06-05 16.0 4.5 1.5 19.7 21.1 17.6 3.0 7.5 12.3 0.0 0.0 0.0
12575 2014-06-06 134.0 20.5 6.0 19.6 20.6 18.4 3.4 5.5 9.8 0.0 0.0 0.0
12576 2014-06-07 70.5 25.0 7.0 18.2 19.2 17.6 6.1 8.2 13.7 0.0 0.0 0.0
12577 2014-06-08 0.0 0.0 0.0 20.5 24.2 17.7 3.2 6.3 11.2 1.4 0.0 0.0
(12573, 13) (2219, 13)

可視化

超大雑把に可視化しておく。残念ながら積雪や降雪の記録は取れていないらしい。三浦地方は大雪は少ないから異常がわかりやすいと思ったのに。。。
pandas-profilingがものすごくリッチに成長していてビックリ。先述の通り相関が高いものを警告したりしてくれてる。

In [8]:
for col in feature_cols:
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df['date'],y=df[col],mode='lines',name=col))
    fig.update_layout(showlegend=True)
    fig.show()
In [9]:
import pandas_profiling as pdp
pdp.ProfileReport(df)



Out[9]:

正規化

標準化、正規化は悩みどころだけど、あとで各カラムの予測値と実測値の差分を足し合わせる関係で正規化にしておく

In [10]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

X_train = train[feature_cols]
X_test = test[feature_cols]

scaler = MinMaxScaler()
scaler = scaler.fit(X_train)

X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])

モデル生成

あまり深くする必要もないのだが、せっかくなので多少は積んでおく。推論が遅くて使えなかったら考える。

In [11]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input,LSTM,Dense,Dropout,RepeatVector,TimeDistributed,BatchNormalization
from tensorflow.keras import regularizers

def build_lstm_ae(X_train):
    inputs = Input(shape=(X_train.shape[1], X_train.shape[2]))
    x = LSTM(256, activation='relu', return_sequences=True, 
             kernel_regularizer=regularizers.l2(0.00))(inputs)
#     x = Dropout(0.1)(x)
    x = BatchNormalization()(x)
    x = LSTM(128, activation='relu', return_sequences=True)(x)
#     x = Dropout(0.1)(x)
    x = BatchNormalization()(x)
    x = LSTM(64, activation='relu', return_sequences=True)(x)
#     x = Dropout(0.1)(x)
    x = BatchNormalization()(x)
    x = LSTM(32, activation='relu', return_sequences=True)(x)
#     x = Dropout(0.1)(x)
    x = BatchNormalization()(x)
    x = LSTM(16, activation='relu', return_sequences=False)(x)
#     x = Dropout(0.1)(x)
    x = BatchNormalization()(x)
    x = RepeatVector(X_train.shape[1])(x)
    x = LSTM(16, activation='relu', return_sequences=True)(x)
#     x = Dropout(0.1)(x)
    x = BatchNormalization()(x)
    x = LSTM(32, activation='relu', return_sequences=True)(x)
#     x = Dropout(0.1)(x)
    x = BatchNormalization()(x)
    x = LSTM(64, activation='relu', return_sequences=True)(x)
#     x = Dropout(0.1)(x)
    x = BatchNormalization()(x)
    x = LSTM(128, activation='relu', return_sequences=True)(x)
#     x = Dropout(0.1)(x)
    x = BatchNormalization()(x)
    x = LSTM(256, activation='relu', return_sequences=True)(x)
    output = TimeDistributed(Dense(X_train.shape[2]))(x)    
    model = Model(inputs=inputs, outputs=output)
    return model
In [12]:
model = build_lstm_ae(X_train)
model.compile(loss='mae',optimizer='adam',metrics=['accuracy'])
model.summary()
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 1, 12)]           0         
_________________________________________________________________
lstm (LSTM)                  (None, 1, 256)            275456    
_________________________________________________________________
batch_normalization (BatchNo (None, 1, 256)            1024      
_________________________________________________________________
lstm_1 (LSTM)                (None, 1, 128)            197120    
_________________________________________________________________
batch_normalization_1 (Batch (None, 1, 128)            512       
_________________________________________________________________
lstm_2 (LSTM)                (None, 1, 64)             49408     
_________________________________________________________________
batch_normalization_2 (Batch (None, 1, 64)             256       
_________________________________________________________________
lstm_3 (LSTM)                (None, 1, 32)             12416     
_________________________________________________________________
batch_normalization_3 (Batch (None, 1, 32)             128       
_________________________________________________________________
lstm_4 (LSTM)                (None, 16)                3136      
_________________________________________________________________
batch_normalization_4 (Batch (None, 16)                64        
_________________________________________________________________
repeat_vector (RepeatVector) (None, 1, 16)             0         
_________________________________________________________________
lstm_5 (LSTM)                (None, 1, 16)             2112      
_________________________________________________________________
batch_normalization_5 (Batch (None, 1, 16)             64        
_________________________________________________________________
lstm_6 (LSTM)                (None, 1, 32)             6272      
_________________________________________________________________
batch_normalization_6 (Batch (None, 1, 32)             128       
_________________________________________________________________
lstm_7 (LSTM)                (None, 1, 64)             24832     
_________________________________________________________________
batch_normalization_7 (Batch (None, 1, 64)             256       
_________________________________________________________________
lstm_8 (LSTM)                (None, 1, 128)            98816     
_________________________________________________________________
batch_normalization_8 (Batch (None, 1, 128)            512       
_________________________________________________________________
lstm_9 (LSTM)                (None, 1, 256)            394240    
_________________________________________________________________
time_distributed (TimeDistri (None, 1, 12)             3084      
=================================================================
Total params: 1,069,836
Trainable params: 1,068,364
Non-trainable params: 1,472
_________________________________________________________________
In [13]:
# AuroEncoderなのでラベルは不要。特徴量とラベルがおんなじって感じ。仮にノイズ除去モデルを作りたいなら第一引数をノイズを付与したX_trainにする
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5, mode='min')
# sb = tf.keras.callbacks.ModelCheckpoint("tf_model.hdf5", monitor="val_loss", verbose=1, save_best_only=True, save_weights_only=False)
history = model.fit(
    X_train, X_train, epochs=100,
    batch_size=256, validation_split=0.1, callbacks=[es], shuffle=False
)
Epoch 1/100
45/45 [==============================] - 2s 46ms/step - loss: 0.0891 - accuracy: 0.3874 - val_loss: 0.2281 - val_accuracy: 0.1367
Epoch 2/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0466 - accuracy: 0.6019 - val_loss: 0.2193 - val_accuracy: 0.1367
Epoch 3/100
45/45 [==============================] - 1s 22ms/step - loss: 0.0400 - accuracy: 0.6370 - val_loss: 0.2089 - val_accuracy: 0.1025
Epoch 4/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0369 - accuracy: 0.6491 - val_loss: 0.1984 - val_accuracy: 0.1025
Epoch 5/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0356 - accuracy: 0.6597 - val_loss: 0.1852 - val_accuracy: 0.1025
Epoch 6/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0329 - accuracy: 0.6855 - val_loss: 0.1702 - val_accuracy: 0.1025
Epoch 7/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0319 - accuracy: 0.6832 - val_loss: 0.1581 - val_accuracy: 0.1025
Epoch 8/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0301 - accuracy: 0.6933 - val_loss: 0.1463 - val_accuracy: 0.1025
Epoch 9/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0318 - accuracy: 0.6977 - val_loss: 0.1382 - val_accuracy: 0.1359
Epoch 10/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0316 - accuracy: 0.7019 - val_loss: 0.1346 - val_accuracy: 0.1852
Epoch 11/100
45/45 [==============================] - 1s 22ms/step - loss: 0.0274 - accuracy: 0.7119 - val_loss: 0.1281 - val_accuracy: 0.2083
Epoch 12/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0264 - accuracy: 0.7152 - val_loss: 0.1169 - val_accuracy: 0.2782
Epoch 13/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0276 - accuracy: 0.7213 - val_loss: 0.1061 - val_accuracy: 0.3959
Epoch 14/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0267 - accuracy: 0.7285 - val_loss: 0.0935 - val_accuracy: 0.5580
Epoch 15/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0253 - accuracy: 0.7281 - val_loss: 0.0842 - val_accuracy: 0.5692
Epoch 16/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0246 - accuracy: 0.7294 - val_loss: 0.0675 - val_accuracy: 0.6645
Epoch 17/100
45/45 [==============================] - 1s 23ms/step - loss: 0.0246 - accuracy: 0.7287 - val_loss: 0.0612 - val_accuracy: 0.6852
Epoch 18/100
45/45 [==============================] - 1s 27ms/step - loss: 0.0245 - accuracy: 0.7326 - val_loss: 0.0549 - val_accuracy: 0.6868
Epoch 19/100
45/45 [==============================] - 1s 24ms/step - loss: 0.0250 - accuracy: 0.7261 - val_loss: 0.0561 - val_accuracy: 0.6773
Epoch 20/100
45/45 [==============================] - 1s 22ms/step - loss: 0.0266 - accuracy: 0.7275 - val_loss: 0.0573 - val_accuracy: 0.6852
Epoch 21/100
45/45 [==============================] - 1s 22ms/step - loss: 0.0289 - accuracy: 0.7262 - val_loss: 0.0571 - val_accuracy: 0.6709
Epoch 22/100
45/45 [==============================] - 1s 22ms/step - loss: 0.0252 - accuracy: 0.7244 - val_loss: 0.0605 - val_accuracy: 0.6598
Epoch 23/100
45/45 [==============================] - 1s 21ms/step - loss: 0.0235 - accuracy: 0.7367 - val_loss: 0.0562 - val_accuracy: 0.6955
In [14]:
model.save('tf_model.hdf5')
In [15]:
history.history.keys()

# fig, axes = plt.subplots(nrows=1, ncols=2, sharex=False)

# axes[0].plot(history.history['loss'],label='Training Loss')
# axes[0].plot(history.history['val_loss'],label='Validation Loss')
# axes[1].plot(history.history['accuracy'],label='Training ACC')
# axes[1].plot(history.history['val_accuracy'],label='Validation ACC')
# plt.legend()
Out[15]:
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
In [16]:
plt.plot(history.history['loss'],label='Training Loss')
plt.plot(history.history['val_loss'],label='Validation Loss')
plt.legend()
Out[16]:
<matplotlib.legend.Legend at 0x7f01798e9a20>
In [17]:
plt.plot(history.history['accuracy'],label='Training ACC')
plt.plot(history.history['val_accuracy'],label='Validation ACC')
plt.legend()
Out[17]:
<matplotlib.legend.Legend at 0x7f01790e2ef0>
In [18]:
model.evaluate(X_test,X_test)
70/70 [==============================] - 0s 4ms/step - loss: 0.0621 - accuracy: 0.6147
Out[18]:
[0.06212100014090538, 0.6146913170814514]

異常検知

AutoEncoderなので予測値は実際の気象情報を変換して復元した値になる。復元した値と実際の値の差分を取って、差分が大きいところを異常とする。
気象情報なので多少は異常があると仮定しているが、仮に本物のセンサデータで異常発生時のデータが無い場合は差分の平均との差分が+3σより大きかったら異常とか、観測データの差分の最大値+αを超えたら異常って扱いにするのかな。
差分はMAEとMSEで見ているが、正規化しているから差分は1未満になるのでMAEよりMSEのほうが顕著に出ているように見える。

BatchNorm入れたらかなり再現力が高くなって分布が見づらくなった。。。分布図をLogスケールで見る方法って無いのかな。。。

In [19]:
# plot the loss distribution of the training set
X_pred = model.predict(X_train)
X_pred = X_pred.reshape(X_pred.shape[0], X_pred.shape[2])
X_pred = pd.DataFrame(X_pred, columns=feature_cols)
X_pred.index = train.index

scored = pd.DataFrame(index=train.index)
Xtrain = X_train.reshape(X_train.shape[0], X_train.shape[2])
scored['Loss_mae'] = np.mean(np.abs(X_pred-Xtrain), axis = 1)
scored['Loss_mse'] = np.mean(np.power(X_pred-Xtrain, 2), axis=1)

plt.figure(figsize=(16,9), dpi=80)
plt.title('Loss Distribution', fontsize=16)
sns.distplot(scored['Loss_mae'], bins=100, kde=True, color='red', label='mae')
sns.distplot(scored['Loss_mse'], bins=100, kde=True, color='blue', label='mse')
plt.xlim([0.0,max(scored['Loss_mae'].max(), scored['Loss_mse'].max())])
Out[19]:
(0.0, 37.041164406346624)
In [20]:
scored
Out[20]:
Loss_mae Loss_mse
0 0.019893 0.000754
1 0.015746 0.000460
2 0.038887 0.002783
3 0.034373 0.002309
4 0.020103 0.000779
... ... ...
12568 0.037056 0.006570
12569 0.035166 0.004674
12570 0.025192 0.002598
12571 0.020814 0.002005
12572 0.032959 0.003726

12573 rows × 2 columns

In [21]:
X_pred = model.predict(X_test)
X_pred = X_pred.reshape(X_pred.shape[0], X_pred.shape[2])
X_pred = pd.DataFrame(X_pred, columns=feature_cols)
X_pred.index = test.index

scored = pd.DataFrame(index=test.index)
Xtest = X_test.reshape(X_test.shape[0], X_test.shape[2])

scored['mae'] = np.mean(np.abs(X_pred-Xtest), axis = 1)
scored['mse'] = np.mean(np.power(X_pred-Xtest, 2), axis = 1)
plt.figure(figsize=(16,9), dpi=80)
plt.title('Test Error Distribution', fontsize=16)
sns.distplot(scored['mae'], bins=100, kde=True, color='red', label='mae')
sns.distplot(scored['mse'], bins=100, kde=True, color='blue', label='mse')
plt.xlim([0.0,max(scored['mae'].max(), scored['mse'].max())])
Out[21]:
(0.0, 16.761199340203188)
In [22]:
X_test_pred = model.predict(X_test)
X_test_pred = X_test_pred.reshape(X_test_pred.shape[0], X_test_pred.shape[2])

Xtest = X_test.reshape(X_test.shape[0], X_test.shape[2])
mae = np.mean(np.abs(X_test_pred-Xtest), axis = 1)
mse = np.mean(np.power(X_test_pred-Xtest, 2), axis = 1)

test_df = test.copy()
test_df['mse'] = mse
test_df['mae'] = mae
In [23]:
# 全体のthreshold*100 % は正常値として扱い、それ以外を異常とする。そのための閾値を計算する。
threshold = 0.998
mse_threshold = np.quantile(mse, threshold)
mae_threshold = np.quantile(mae, threshold)
print('MSE {0} threshold:{1}'.format(threshold, mse_threshold))
print('MAE {0} threshold:{1}'.format(threshold, mae_threshold))
MSE 0.998 threshold:4.297271796661005
MAE 0.998 threshold:1.677339384069937
In [24]:
display(test_df[test_df['mse'] > mse_threshold])
display(test_df[test_df['mae'] > mae_threshold])
date precipitation_sum precipitation_max_hour precipitation_max_10minute temperature_mean temperature_max temperature_min wind_velocity_mean wind_velocity_max maximum_instantaneous_wind_speed solar_irradiation_hour snowfall snow_accumulation mse mae
12697 2014-10-06 85.0 34.0 11.0 21.5 26.6 16.6 6.6 22.1 38.5 3.6 0.0 0.0 5.534084 2.114674
13775 2017-09-18 17.0 13.5 8.0 26.3 30.2 21.3 6.3 14.4 25.3 11.1 0.0 0.0 9.952166 2.449367
14127 2018-09-05 18.5 15.0 5.5 26.6 29.1 22.8 7.8 11.5 19.6 8.6 0.0 0.0 6.392217 1.881369
14153 2018-10-01 6.5 13.5 3.0 25.0 28.0 21.6 9.0 20.9 36.5 10.8 0.0 0.0 16.761199 2.985550
14496 2019-09-09 91.5 33.0 11.0 27.5 31.2 24.8 7.6 21.0 41.7 7.2 0.0 0.0 11.858973 2.702625
date precipitation_sum precipitation_max_hour precipitation_max_10minute temperature_mean temperature_max temperature_min wind_velocity_mean wind_velocity_max maximum_instantaneous_wind_speed solar_irradiation_hour snowfall snow_accumulation mse mae
12697 2014-10-06 85.0 34.0 11.0 21.5 26.6 16.6 6.6 22.1 38.5 3.6 0.0 0.0 5.534084 2.114674
13775 2017-09-18 17.0 13.5 8.0 26.3 30.2 21.3 6.3 14.4 25.3 11.1 0.0 0.0 9.952166 2.449367
14127 2018-09-05 18.5 15.0 5.5 26.6 29.1 22.8 7.8 11.5 19.6 8.6 0.0 0.0 6.392217 1.881369
14153 2018-10-01 6.5 13.5 3.0 25.0 28.0 21.6 9.0 20.9 36.5 10.8 0.0 0.0 16.761199 2.985550
14496 2019-09-09 91.5 33.0 11.0 27.5 31.2 24.8 7.6 21.0 41.7 7.2 0.0 0.0 11.858973 2.702625
In [25]:
test_df['mse_threshold'] = mse_threshold
fig = go.Figure()
fig.add_trace(go.Scatter(x=test_df['date'],y=test_df['mse'],mode='lines',name='mse'))
fig.add_trace(go.Scatter(x=test_df['date'],y=test_df['mse_threshold'],mode='lines',name='mse_threshold'))
fig.update_layout(showlegend=True)
fig.show()

test_df['mae_threshold'] = mae_threshold
fig = go.Figure()
fig.add_trace(go.Scatter(x=test_df['date'],y=test_df['mae'],mode='lines',name='mae'))
fig.add_trace(go.Scatter(x=test_df['date'],y=test_df['mae_threshold'],mode='lines',name='mae_threshold'))
fig.update_layout(showlegend=True)
fig.show()

fig = go.Figure()
fig.add_trace(go.Scatter(x=test_df['date'],y=test_df['mse'],mode='lines',name='mse'))
fig.add_trace(go.Scatter(x=test_df['date'],y=test_df['mse_threshold'],mode='lines',name='mse_threshold'))
fig.add_trace(go.Scatter(x=test_df['date'],y=test_df['mae'],mode='lines',name='mae'))
fig.add_trace(go.Scatter(x=test_df['date'],y=test_df['mae_threshold'],mode='lines',name='mae_threshold'))
fig.update_layout(showlegend=True)
fig.show()

実際にはリアルタイムに流れてくるデータを定期的に刈り取って予測すると思われるので、テストデータを定期的に区切って予測してみる。特に意味は無いけど、30件までしか一度に予測しない前提で行う。
あれ、頂点の値がさっきと一緒?LSTMってそういうものだっけ。時系列を見るから古いデータを捨てて予測したら別の値が出るものだと思ってたけど。。。

In [26]:
X_test.shape
Out[26]:
(2219, 1, 12)
In [27]:
model = tf.keras.models.load_model('tf_model.hdf5')
In [28]:
offset = 0
for i in range(2200):
    fetch_size = np.random.randint(4,10)
    offset = offset + fetch_size
    X_test_tmp = X_test[0:offset][-30:]
    
    X_test_pred_tmp = model.predict(X_test_tmp)
    X_test_pred_tmp = X_test_pred_tmp.reshape(X_test_pred_tmp.shape[0], X_test_pred_tmp.shape[2])

    Xtest_tmp = X_test_tmp.reshape(X_test_tmp.shape[0], X_test_tmp.shape[2])
    mae = np.mean(np.abs(X_test_pred_tmp-Xtest_tmp), axis = 1)
    mse = np.mean(np.power(X_test_pred_tmp-Xtest_tmp, 2), axis = 1)
    
    test_df = test[0:offset][-30:].copy()
    test_df['mse'] = mse
    
    test_df['mse_threshold'] = mse_threshold
    if test_df['mse'].max() > mse_threshold:
        fig = go.Figure()
        fig.add_trace(go.Scatter(x=test_df['date'],y=test_df['mse'],mode='lines',name='mse'))
        fig.add_trace(go.Scatter(x=test_df['date'],y=test_df['mse_threshold'],mode='lines',name='mse_threshold'))
        fig.update_layout(showlegend=True)
        fig.show()
In [ ]:
 
In [ ]: